Space-Time Prediction of Bike Share Demand: Philadelphia Indego

Author

Xinyuan(Christine) Cui, Yuqing(Demi) Yang

Published

December 9, 2025


PART 1: Compare 2024 Q3 with 2025 Q1

1. Setup

1.1 Load libraries

Code
# Core tidyverse
library(tidyverse)
library(lubridate)

# Spatial data
library(sf)
library(tigris)

# Census data
library(tidycensus)

# Weather data
library(riem)  # For Philadelphia weather from ASOS stations

# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
library(ggplot2)

# here!
library(here)
# Get rid of scientific notation. We gotta look good!
options(scipen = 999)

Sys.setlocale("LC_TIME", "English")
[1] ""

1.2 Define Themes

Code
plotTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title = element_text(size = 11, face = "bold"),
  panel.background = element_blank(),
  panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
  panel.grid.minor = element_blank(),
  axis.ticks = element_blank(),
  legend.position = "right"
)

mapTheme <- theme(
  plot.title = element_text(size = 14, face = "bold"),
  plot.subtitle = element_text(size = 10),
  plot.caption = element_text(size = 8),
  axis.line = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.title = element_blank(),
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_line(colour = 'transparent'),
  panel.grid.minor = element_blank(),
  legend.position = "right",
  plot.margin = margin(1, 1, 1, 1, 'cm'),
  legend.key.height = unit(1, "cm"),
  legend.key.width = unit(0.2, "cm")
)

palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")

1.3 Set Census API Key

Code
census_api_key("42bf8a20a3df1def380f330cf7edad0dd5842ce6", overwrite = TRUE, install = TRUE)

2. Data Import & Preparation

2.1 Load Indego Trip Data

Code
# Read Q3 2024 data
indego_q3<- read_csv(here("assignments/assignment_5/indego-trips-2024-q3.csv"))
# Read Q1 2025 data
indego_q1<- read_csv(here("assignments/assignment_5/indego-trips-2025-q1.csv"))

Why Q3?

  • Extreme Seasonal Variation: This selection allows us to compare “peak season” (Summer) against “off-peak season” (Winter). Comparing to Q1, Q3 represents the warmest months with ideal biking conditions and high leisure activity.
  • Model Generalizability: By training models on two drastically different quarters, we can test whether our predictive features (such as spatial lags or rush-hour indicators) remain robust across different volume levels.
  • Behavioral Shifts: Summer data is more likely to be diverse, which likely includes a mix of commuters, tourists, and recreational riders, whereas winter data isolates the “core user” base—resilient commuters who rely on Indego regardless of the weather.

2.2 Examine the Data Structure

Code
indego_q3$start_time_obj <- mdy_hm(indego_q3$start_time)
indego_q1$start_time_obj <- mdy_hm(indego_q1$start_time)

comparison_df <- data.frame(
  Metric = c(
    "Total Trips", 
    "Date Range", 
    "Unique Start Stations",
    "Trip: One Way",
    "Trip: Round Trip",
    "Bike: Standard",
    "Bike: Electric",
    "Passholder: Day Pass",
    "Passholder: Walk-up"
  ),
  
  Q3_2024 = c(
    nrow(indego_q3),
    paste(format(min(indego_q3$start_time_obj), "%Y-%m-%d"), "to", format(max(indego_q3$start_time_obj), "%Y-%m-%d")),
    length(unique(indego_q3$start_station)),
    sum(indego_q3$trip_route_category == "One Way", na.rm = TRUE),
    sum(indego_q3$trip_route_category == "Round Trip", na.rm = TRUE),
    sum(indego_q3$bike_type == "standard", na.rm = TRUE),
    sum(indego_q3$bike_type == "electric", na.rm = TRUE),
    sum(indego_q3$passholder_type == "Day Pass", na.rm = TRUE),
    sum(indego_q3$passholder_type == "Walk-up", na.rm = TRUE)
  ),
  
  Q1_2025 = c(
    nrow(indego_q1),
    paste(format(min(indego_q1$start_time_obj), "%Y-%m-%d"), "to", format(max(indego_q1$start_time_obj), "%Y-%m-%d")),
    length(unique(indego_q1$start_station)),
    sum(indego_q1$trip_route_category == "One Way", na.rm = TRUE),
    sum(indego_q1$trip_route_category == "Round Trip", na.rm = TRUE),
    sum(indego_q1$bike_type == "standard", na.rm = TRUE),
    sum(indego_q1$bike_type == "electric", na.rm = TRUE),
    sum(indego_q1$passholder_type == "Day Pass", na.rm = TRUE),
    sum(indego_q1$passholder_type == "Walk-up", na.rm = TRUE)
  )
)

kable(comparison_df, caption = "Comparison of Indego Q3 2024 vs Q1 2025")
Comparison of Indego Q3 2024 vs Q1 2025
Metric Q3_2024 Q1_2025
Total Trips 408408 201588
Date Range 2024-07-01 to 2024-09-30 2025-01-01 to 2025-03-31
Unique Start Stations 261 265
Trip: One Way 380939 190792
Trip: Round Trip 27469 10796
Bike: Standard 171569 72027
Bike: Electric 236839 129561
Passholder: Day Pass 22885 5494
Passholder: Walk-up 13308 10419
  • The system exhibits a sharp seasonal contraction with total ridership dropping by approximately 50% (408,408 to 201,588 trips). This decline is disproportionately driven by the loss of leisure and casual users, evidenced by a drastic 76% plummet in Day Pass usage (22,885 down to 5,494) and a significant reduction in Round Trips. While the physical network remained stable (growing slightly from 261 to 265 stations), the data highlights a persistent user preference for electric bikes, which maintained higher usage volumes than standard bikes in both seasons. But it may also be affected by the fleet size of Indego.

2.3 Create Time Bins

Code
panel_q3_base <- indego_q3 %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE), 
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )


# Look at temporal features
head(panel_q3_base %>% select(start_datetime, interval60, week, dotw, hour, weekend))
# A tibble: 6 × 6
  start_datetime      interval60           week dotw   hour weekend
  <dttm>              <dttm>              <dbl> <ord> <int>   <dbl>
1 2024-07-01 00:02:00 2024-07-01 00:00:00    27 Mon       0       0
2 2024-07-01 00:03:00 2024-07-01 00:00:00    27 Mon       0       0
3 2024-07-01 00:04:00 2024-07-01 00:00:00    27 Mon       0       0
4 2024-07-01 00:05:00 2024-07-01 00:00:00    27 Mon       0       0
5 2024-07-01 00:06:00 2024-07-01 00:00:00    27 Mon       0       0
6 2024-07-01 00:06:00 2024-07-01 00:00:00    27 Mon       0       0
Code
cat("2024 Q3 Base Panel:", format(nrow(panel_q3_base), big.mark = ","), "rows\n")
2024 Q3 Base Panel: 408,408 rows
Code
panel_q1_base <- indego_q1 %>%
  mutate(
    # Parse datetime
    start_datetime = mdy_hm(start_time),
    end_datetime = mdy_hm(end_time),
    
    # Create hourly bins
    interval60 = floor_date(start_datetime, unit = "hour"),
    
    # Extract time features
    week = week(interval60),
    month = month(interval60, label = TRUE, locale = "en_US.UTF-8"),
    dotw = wday(interval60, label = TRUE, locale = "en_US.UTF-8"),
    hour = hour(interval60),
    date = as.Date(interval60),
    
    # Create useful indicators
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )
# Look at temporal features
head(panel_q1_base %>% select(start_datetime, interval60, week, dotw, hour, weekend))
# A tibble: 6 × 6
  start_datetime      interval60           week dotw   hour weekend
  <dttm>              <dttm>              <dbl> <ord> <int>   <dbl>
1 2025-01-01 00:00:00 2025-01-01 00:00:00     1 Wed       0       0
2 2025-01-01 00:04:00 2025-01-01 00:00:00     1 Wed       0       0
3 2025-01-01 00:05:00 2025-01-01 00:00:00     1 Wed       0       0
4 2025-01-01 00:05:00 2025-01-01 00:00:00     1 Wed       0       0
5 2025-01-01 00:08:00 2025-01-01 00:00:00     1 Wed       0       0
6 2025-01-01 00:14:00 2025-01-01 00:00:00     1 Wed       0       0
Code
cat("2025 Q1 Base Panel:", format(nrow(panel_q1_base), big.mark = ","), "rows\n")
2025 Q1 Base Panel: 201,588 rows

3. Exploratory Analysis

3.1 Daily Patterns

Code
# Daily trip counts
daily_trips_q3 <- panel_q3_base %>%
  group_by(date) %>%
  summarize(trips = n())

daily_trips_q3_plot <- ggplot(daily_trips_q3, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q3 2024",
    subtitle = "Summer demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

daily_trips_q1 <- panel_q1_base %>%
  group_by(date) %>%
  summarize(trips = n())

daily_trips_q1_plot <- ggplot(daily_trips_q1, aes(x = date, y = trips)) +
  geom_line(color = "#3182bd", linewidth = 1) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  labs(
    title = "Indego Daily Ridership - Q1 2025",
    subtitle = "Winter demand patterns in Philadelphia",
    x = "Date",
    y = "Daily Trips",
    caption = "Source: Indego bike share"
  ) +
  plotTheme

library(patchwork)
daily_trips_q3_plot / daily_trips_q1_plot

How does ridership change over time?

We find that seasonal influence is the dominant driver of ridership patterns, with distinct behaviors in each quarter:

  • Q3 2024 (Summer) shows consistently high demand, fluctuating between 3,000 and 5,000 daily trips, whereas Q1 2025 (Winter) sees significantly lower activity, ranging from 1,000 to 3,000 trips.

  • Summer ridership remains relatively stable and flat across July through September. In contrast, Winter ridership exhibits a clear upward trend, starting low in January and gradually increasing as the season transitions toward spring in March.

  • Summer displays a regular, rhythmic cycle (likely weekly commute patterns), while Winter demand is highly volatile with sharp drops, likely reflecting greater sensitivity to extreme cold or precipitation events.

3.2 Hourly Patterns

Code
# Average trips by hour and day type
hourly_patterns_q3 <- panel_q3_base %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

hourly_trips_q3_plot <- ggplot(hourly_patterns_q3, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns in 2024 Q3",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

hourly_patterns_q1 <- panel_q1_base %>%
  group_by(hour, weekend) %>%
  summarize(avg_trips = n() / n_distinct(date)) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

hourly_trips_q1_plot <- ggplot(hourly_patterns_q1, aes(x = hour, y = avg_trips, color = day_type)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Average Hourly Ridership Patterns in 2025 Q1",
    subtitle = "Clear commute patterns on weekdays",
    x = "Hour of Day",
    y = "Average Trips per Hour",
    color = "Day Type"
  ) +
  plotTheme

hourly_trips_q3_plot / hourly_trips_q1_plot

When are the peak hours? How do weekends differ from weekdays?

Based on the hourly ridership charts provided, here is the analysis of peak hours and day-type differences:

  • There is a distinct “commuter profile” with two sharp peaks. The morning peak occurs around 8:00 AM, and the evening peak—which is consistently higher—occurs around 5:00 PM (17:00). Instead of sharp spikes, weekends show a single broad plateau where demand peaks in the mid-afternoon, roughly between 12:00 PM and 4:00 PM.

  • Weekdays exhibit a two-peak distribution driven by work commutes, whereas weekends follow a bell-shaped curve consistent with leisure or errand activity. In Summer, weekend demand is robust, with the afternoon plateau hovering near 250-300 trips per hour, nearly matching weekday midday levels.


4. Get Philadelphia Spatial Context

4.1 Load Philadelphia Census Data

Code
# Get Philadelphia census tracts
philly_census <- get_acs(
  geography = "tract",
  variables = c(
    "B01003_001",  # Total population
    "B19013_001",  # Median household income
    "B08301_001",  # Total commuters
    "B08301_010",  # Commute by transit
    "B02001_002",  # White alone
    "B25077_001"   # Median home value
  ),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  geometry = TRUE,
  output = "wide"
) %>%
  rename(
    Total_Pop = B01003_001E,
    Med_Inc = B19013_001E,
    Total_Commuters = B08301_001E,
    Transit_Commuters = B08301_010E,
    White_Pop = B02001_002E,
    Med_Home_Value = B25077_001E
  ) %>%
  mutate(
    Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
    Percent_White = (White_Pop / Total_Pop) * 100
  ) %>%
  st_transform(crs = 4326)

4.2 Map Philadelphia Context

Code
# Map median income
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Context for understanding bike share demand patterns"
  ) +
  # Stations 
  geom_point(
    data = indego_q3,
    aes(x = start_lon, y = start_lat),
    color = "red", size = 0.25, alpha = 0.6
  ) +
  mapTheme

  • The map reveals that Indego stations are heavily clustered within Philadelphia’s core, specifically in Center City and University City, which correspond to the highest median household income tracts. Conversely, the network density drops precipitously in the lower-income neighborhoods of North and West Philadelphia, highlighting a spatial disparity where bike share infrastructure is significantly more accessible to wealthier residents compared to those in economically disadvantaged areas.

4.3 Join Census Data to Stations

Code
# Create sf object for stations
stations_sf_q3 <- panel_q3_base %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

stations_sf_q1 <- panel_q1_base %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)

# Spatial join to get census tract for each station
stations_census_q3 <- st_join(stations_sf_q3, philly_census, left = TRUE) %>%
  st_drop_geometry()

stations_census_q1 <- st_join(stations_sf_q1, philly_census, left = TRUE) %>%
  st_drop_geometry()

# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.

stations_for_map <- panel_q3_base %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census_q3 %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Add back to trip data
indego_census_q3 <- panel_q3_base %>%
  left_join(
    stations_census_q3 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

indego_census_q1 <- panel_q1_base %>%
  left_join(
    stations_census_q1 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

# Prepare data for visualization
stations_for_map <- indego_q3 %>%
  distinct(start_station, start_lat, start_lon) %>%
  filter(!is.na(start_lat), !is.na(start_lon)) %>%
  left_join(
    stations_census_q3 %>% select(start_station, Med_Inc),
    by = "start_station"
  ) %>%
  mutate(has_census = !is.na(Med_Inc))

# Create the map showing problem stations
ggplot() +
  geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
  scale_fill_viridis(
    option = "viridis",
    name = "Median\nIncome",
    labels = scales::dollar,
    na.value = "grey90"
  ) +
  # Stations with census data (small grey dots)
  geom_point(
    data = stations_for_map %>% filter(has_census),
    aes(x = start_lon, y = start_lat),
    color = "grey30", size = 1, alpha = 0.6
  ) +
  # Stations WITHOUT census data (red X marks the spot)
  geom_point(
    data = stations_for_map %>% filter(!has_census),
    aes(x = start_lon, y = start_lat),
    color = "red", size = 1, shape = 4, stroke = 1.5
  ) +
  labs(
    title = "Philadelphia Median Household Income by Census Tract",
    subtitle = "Indego stations shown (RED = no census data match)",
    caption = "Red X marks indicate stations that didn't join to census tracts"
  ) +
  mapTheme

  • The code performs a spatial data validation check. Its main logic is to geometrically link every Indego bike station to its underlying US Census tract to inherit demographic variables (like Median Household Income).

  • The map acts as a diagnostic tool where the Red X marks reveal a “missing data” problem. These specific stations failed to match with a census tract, mostly because they are located on waterfront piers or inside parks/industrial zones—areas that are non-residential and therefore have no recorded “Median Household Income”. These stations must be excluded from the predictive model because they lack the necessary demographic input data.

4.4 Dealing with missing data

Code
# Identify which stations to keep
valid_stations_q3 <- stations_census_q3 %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

valid_stations_q1 <- stations_census_q1 %>%
  filter(!is.na(Med_Inc)) %>%
  pull(start_station)

# Filter trip data to valid stations only
indego_census_q3 <- panel_q3_base %>%
  filter(start_station %in% valid_stations_q3) %>%
  left_join(
    stations_census_q3 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

indego_census_q1 <- panel_q1_base %>%
  filter(start_station %in% valid_stations_q1) %>%
  left_join(
    stations_census_q1 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, 
             Percent_White, Total_Pop),
    by = "start_station"
  )

5. Get Weather Data

5.1 Load Data

Code
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q3 2024: July 1 - September 30
weather_data_q3 <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2024-07-01",
  date_end = "2024-09-30"
)

# Process weather data - FIXED: Aggregating to hourly to prevent duplicate rows
weather_processed_q3 <- weather_data_q3 %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  # FIX: Use summarize instead of distinct to handle multiple readings per hour
  group_by(interval60) %>%
  summarize(
    Temperature = mean(Temperature, na.rm = TRUE),
    Precipitation = sum(Precipitation, na.rm = TRUE),
    Wind_Speed = mean(Wind_Speed, na.rm = TRUE)
  ) %>%
  ungroup()

# Check for missing hours and interpolate if needed
weather_complete_q3 <- weather_processed_q3 %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
summary(weather_complete_q3 %>% select(Temperature, Precipitation, Wind_Speed))
  Temperature    Precipitation        Wind_Speed    
 Min.   :55.00   Min.   :0.000000   Min.   : 0.000  
 1st Qu.:70.00   1st Qu.:0.000000   1st Qu.: 4.000  
 Median :76.00   Median :0.000000   Median : 6.667  
 Mean   :75.96   Mean   :0.009819   Mean   : 6.670  
 3rd Qu.:81.00   3rd Qu.:0.000000   3rd Qu.: 9.000  
 Max.   :98.00   Max.   :2.750100   Max.   :24.000  
Code
# This covers Q1 2025: Jan 1 - Mar 31
weather_data_q1 <- riem_measures(
  station = "PHL",  # Philadelphia International Airport
  date_start = "2025-01-01",
  date_end = "2025-03-31"
)

# Process weather data - FIXED: Aggregating to hourly to prevent duplicate rows
weather_processed_q1 <- weather_data_q1 %>%
  mutate(
    interval60 = floor_date(valid, unit = "hour"),
    Temperature = tmpf,  # Temperature in Fahrenheit
    Precipitation = ifelse(is.na(p01i), 0, p01i),  # Hourly precip in inches
    Wind_Speed = sknt  # Wind speed in knots
  ) %>%
  # FIX: Use summarize instead of distinct to handle multiple readings per hour
  group_by(interval60) %>%
  summarize(
    Temperature = mean(Temperature, na.rm = TRUE),
    Precipitation = sum(Precipitation, na.rm = TRUE),
    Wind_Speed = mean(Wind_Speed, na.rm = TRUE)
  ) %>%
  ungroup()

# Check for missing hours and interpolate if needed
weather_complete_q1 <- weather_processed_q1 %>%
  complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
  fill(Temperature, Precipitation, Wind_Speed, .direction = "down")

# Look at the weather
summary(weather_complete_q1 %>% select(Temperature, Precipitation, Wind_Speed))
  Temperature    Precipitation        Wind_Speed    
 Min.   :10.00   Min.   :0.000000   Min.   : 0.000  
 1st Qu.:30.00   1st Qu.:0.000000   1st Qu.: 5.625  
 Median :37.00   Median :0.000000   Median : 8.000  
 Mean   :38.35   Mean   :0.006774   Mean   : 9.329  
 3rd Qu.:46.00   3rd Qu.:0.000000   3rd Qu.:13.000  
 Max.   :78.00   Max.   :1.240100   Max.   :30.000  

5.2 Visualize Weather Patterns

Code
weather_q3_plot <- ggplot(weather_complete_q3, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - 2024 Q3",
    subtitle = "Summer to early autumn transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

weather_q1_plot <- ggplot(weather_complete_q1, aes(x = interval60, y = Temperature)) +
  geom_line(color = "#3182bd", alpha = 0.7) +
  geom_smooth(se = FALSE, color = "red") +
  labs(
    title = "Philadelphia Temperature - 2025 Q1",
    subtitle = "Winter to early spring transition",
    x = "Date",
    y = "Temperature (°F)"
  ) +
  plotTheme

weather_q3_plot / weather_q1_plot

Based on the temperature plots, the two quarters exhibit distinct thermal profiles:

  • Summer (Q3 2024) temperatures are consistently high and stable, fluctuating regularly between 70°F and 90°F with a slight cooling trend as autumn approaches. Winter (Q1 2025) shows a clear upward warming trend, starting with freezing lows (dipping near 20°F) in January and rising to milder spring temperatures (peaks near 70°F) by late March.

  • The winter plot reveals sharper volatility, including a notable cold snap in late January where temperatures plummeted significantly, whereas summer follows a more predictable daily cycle.


6. Create Complete Space-Time Panel

  • This code constructs the complete space-time panel dataset required for modeling. It first aggregates raw trip data into hourly counts and joins this to a full grid of all possible station-hour combinations, ensuring that periods with no ridership are explicitly recorded as zeros rather than missing values. The dataset is then enriched with temporal features (e.g., rush hour flags), weather data, and census demographics, before calculating lag variables (demand from 1, 3, and 24 hours ago) to allow the model to use recent history as a predictor.

6.1 2024 Q3 Complete Panel

Code
# 1.Aggregate Trips to Station-Hour Level
# Count trips by station-hour
trips_panel_q3 <- indego_census_q3 %>%
  group_by(interval60, start_station, start_lat, start_lon) %>%
  summarize(Trip_Count = n()) %>%
  ungroup()
Code
# 2. Create Complete Panel Structure
# Calculate expected panel size
# Note: Q3 2024 (July, Aug, Sept) has 92 days. 92 * 24 = 2208 hours.
n_stations <- length(unique(trips_panel_q3$start_station))

# Manual set Q3 time range to ensure we capture ALL hours, even if no trips happened
q3_start <- ymd_hms("2024-07-01 00:00:00")
q3_end   <- ymd_hms("2024-09-30 23:00:00")
full_time_seq <- seq(from = q3_start, to = q3_end, by = "1 hour")

n_hours <- length(full_time_seq)
expected_rows_q3 <- n_stations * n_hours

cat("Expected panel rows:", format(expected_rows_q3, big.mark = ","), "\n")
Expected panel rows: 532,128 
Code
cat("Current rows in trips_panel:", format(nrow(trips_panel_q3), big.mark = ","), "\n")
Current rows in trips_panel: 193,072 
Code
# Create complete panel
study_panel_q3 <- expand.grid(
  interval60 = full_time_seq,               # Use the full explicit time sequence
  start_station = unique(trips_panel_q3$start_station)
) %>%
  # Join trip counts (this will generate NAs for hours with no trips)
  left_join(trips_panel_q3, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0 (Crucial step!)
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Join attributes back to the main expanded panel
# Note: We select specific columns from study_panel_q3 to avoid duplicate attribute columns before joining
study_panel_q3 <- study_panel_q3 %>%
  left_join(
    stations_census_q3 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop),
    by = "start_station"
  )

# Verify we have complete panel
cat("Complete panel rows Q3:", format(nrow(study_panel_q3), big.mark = ","), "\n")
Complete panel rows Q3: 532,128 
Code
# 3. Add Time Features
study_panel_q3 <- study_panel_q3 %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )
Code
# 4. Join Weather Data
study_panel_q3 <- study_panel_q3 %>%
  left_join(weather_complete_q3, by = "interval60")

# Check for missing values
summary(study_panel_q3 %>% select(Trip_Count, Temperature, Precipitation))
   Trip_Count      Temperature    Precipitation   
 Min.   : 0.000   Min.   :55.00   Min.   :0.0000  
 1st Qu.: 0.000   1st Qu.:70.00   1st Qu.:0.0000  
 Median : 0.000   Median :76.00   Median :0.0000  
 Mean   : 0.722   Mean   :75.96   Mean   :0.0098  
 3rd Qu.: 1.000   3rd Qu.:81.00   3rd Qu.:0.0000  
 Max.   :24.000   Max.   :98.00   Max.   :2.7501  
                  NA's   :5784    NA's   :5784    

Code
# 5. Create Temporal Lag Variables
# Sort by station and time
study_panel_q3 <- study_panel_q3 %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel_q3 <- study_panel_q3 %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete_q3 <- study_panel_q3 %>%
  filter(!is.na(lag1day))

cat("Final 2024 Q3 Panel:", format(nrow(study_panel_complete_q3), big.mark = ","), "\n")
Final 2024 Q3 Panel: 526,344 

6.2 2025 Q1 Complete Panel

Code
# 1.Aggregate Trips to Station-Hour Level
# Count trips by station-hour
trips_panel_q1 <- indego_census_q1 %>%
  group_by(interval60, start_station, start_lat, start_lon) %>%
  summarize(Trip_Count = n(), .groups = "drop")
Code
# 2. Create Complete Panel Structure
# Calculate expected panel size
n_stations <- length(unique(trips_panel_q1$start_station))

# Manual set Q1 2025 time range to ensure we capture ALL hours
# Q1 is Jan 1 to Mar 31
q1_start <- ymd_hms("2025-01-01 00:00:00")
q1_end   <- ymd_hms("2025-03-31 23:00:00")

# Force generate a complete hourly sequence
full_time_seq_q1 <- seq(from = q1_start, to = q1_end, by = "1 hour")

n_hours <- length(full_time_seq_q1)
expected_rows_q1 <- n_stations * n_hours

cat("Expected panel rows Q1:", format(expected_rows_q1, big.mark = ","), "\n")
Expected panel rows Q1: 529,200 
Code
cat("Current rows in trips_panel:", format(nrow(trips_panel_q1), big.mark = ","), "\n")
Current rows in trips_panel: 116,718 
Code
# Create complete panel
study_panel_q1 <- expand.grid(
  interval60 = full_time_seq_q1,            # Use the full explicit time sequence
  start_station = unique(trips_panel_q1$start_station)
) %>%
  # Join trip counts
  left_join(trips_panel_q1, by = c("interval60", "start_station")) %>%
  # Replace NA trip counts with 0 (Crucial step!)
  mutate(Trip_Count = replace_na(Trip_Count, 0))

# Join attributes back to the main expanded panel
# Use select to ensure we don't duplicate columns
study_panel_q1 <- study_panel_q1 %>%
  left_join(
    stations_census_q1 %>% 
      select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop),
    by = "start_station"
  )

# Verify we have complete panel
cat("Complete panel rows Q1:", format(nrow(study_panel_q1), big.mark = ","), "\n")
Complete panel rows Q1: 529,200 
Code
# 3. Add Time Features
study_panel_q1 <- study_panel_q1 %>%
  mutate(
    week = week(interval60),
    month = month(interval60, label = TRUE),
    dotw = wday(interval60, label = TRUE),
    hour = hour(interval60),
    date = as.Date(interval60),
    weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
    rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
  )
Code
# 4. Join Weather Data
study_panel_q1 <- study_panel_q1 %>%
  left_join(weather_complete_q1, by = "interval60")

# Check for missing values
summary(study_panel_q1 %>% select(Trip_Count, Temperature, Precipitation))
   Trip_Count       Temperature    Precipitation   
 Min.   : 0.0000   Min.   :10.00   Min.   :0.0000  
 1st Qu.: 0.0000   1st Qu.:30.00   1st Qu.:0.0000  
 Median : 0.0000   Median :37.00   Median :0.0000  
 Mean   : 0.3569   Mean   :38.35   Mean   :0.0068  
 3rd Qu.: 0.0000   3rd Qu.:46.00   3rd Qu.:0.0000  
 Max.   :26.0000   Max.   :78.00   Max.   :1.2401  
                   NA's   :5880    NA's   :5880    

Code
# 5. Create Temporal Lag Variables
# Sort by station and time
study_panel_q1 <- study_panel_q1 %>%
  arrange(start_station, interval60)

# Create lag variables WITHIN each station
study_panel_q1 <- study_panel_q1 %>%
  group_by(start_station) %>%
  mutate(
    lag1Hour = lag(Trip_Count, 1),
    lag2Hours = lag(Trip_Count, 2),
    lag3Hours = lag(Trip_Count, 3),
    lag12Hours = lag(Trip_Count, 12),
    lag1day = lag(Trip_Count, 24)
  ) %>%
  ungroup()

# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete_q1 <- study_panel_q1 %>%
  filter(!is.na(lag1day))

cat("Final 2025 Q1 Panel:", format(nrow(study_panel_complete_q1), big.mark = ","), "\n")
Final 2025 Q1 Panel: 523,320 

7. Visualize Temporal Patterns

7.1 Lag Correlations

Code
# Sample one station to visualize
example_station <- study_panel_complete_q3 %>%
  filter(start_station == first(start_station)) %>%
  head(168)  # One week

# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
  geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
  geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
  geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
  scale_color_manual(values = c(
    "Current" = "#08519c",
    "1 Hour Ago" = "#3182bd",
    "24 Hours Ago" = "#6baed6"
  )) +
  labs(
    title = "Temporal Lag Patterns at One Station",
    subtitle = "Past demand predicts future demand",
    x = "Date-Time",
    y = "Trip Count",
    color = "Time Period"
  ) +
  plotTheme

  • This visualization confirms the strong temporal autocorrelation inherent in bike share data. By overlaying current trip counts (dark blue) with demand from the previous hour and the previous day, we observe a distinct, repetitive daily rhythm. The close alignment between the “Current” trend and the “24 Hours Ago” line demonstrates that demand is highly cyclical, with peaks recurring at similar times each day. This visual evidence validates the decision to engineer lag features for the predictive model, proving that historical usage (particularly the previous day’s pattern) is a powerful predictor of future demand.

8. Temporal Train/Test Split

8.1 2024 Q3

Code
# Q3 has weeks 27-40 (Jul-Sep)
# Train on weeks 27-35 (Jul 1 - early September)
# Test on weeks 36-40 (rest of September)
unique(study_panel_complete_q3$week)
 [1] 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Code
# Which stations have trips in BOTH early and late periods?
early_stations_q3 <- study_panel_complete_q3 %>%
  filter(week < 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations_q3 <- study_panel_complete_q3 %>%
  filter(week >= 36) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations_q3 <- intersect(early_stations_q3, late_stations_q3)

# Filter panel to only common stations
study_panel_complete_q3 <- study_panel_complete_q3 %>%
  filter(start_station %in% common_stations_q3)

# NOW create train/test split
train_q3 <- study_panel_complete_q3 %>%
  filter(week < 36)

test_q3 <- study_panel_complete_q3 %>%
  filter(week >= 36)

cat("Training observations Q3:", format(nrow(train_q3), big.mark = ","), "\n")
Training observations Q3: 349,680 
Code
cat("Testing observations Q3:", format(nrow(test_q3), big.mark = ","), "\n")
Testing observations Q3: 163,560 

8.2 2025 Q1

Code
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)
unique(study_panel_complete_q1$week)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13
Code
# Which stations have trips in BOTH early and late periods?
early_stations_q1 <- study_panel_complete_q1 %>%
  filter(week < 10) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

late_stations_q1 <- study_panel_complete_q1 %>%
  filter(week >= 10) %>%
  filter(Trip_Count > 0) %>%
  distinct(start_station) %>%
  pull(start_station)

# Keep only stations that appear in BOTH periods
common_stations_q1 <- intersect(early_stations_q1, late_stations_q1)

# Filter panel to only common stations
study_panel_complete_q1 <- study_panel_complete_q1 %>%
  filter(start_station %in% common_stations_q1)

# NOW create train/test split
train_q1 <- study_panel_complete_q1 %>%
  filter(week < 10)

test_q1 <- study_panel_complete_q1 %>%
  filter(week >= 10)

cat("Training observations Q1:", format(nrow(train_q1), big.mark = ","), "\n")
Training observations Q1: 354,144 
Code
cat("Testing observations Q1:", format(nrow(test_q1), big.mark = ","), "\n")
Testing observations Q1: 154,224 

9. Build Predictive Models

9.1 2024 Q3 Models

Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train_q3 <- train_q3 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train_q3$dotw_simple) <- contr.treatment(7)

# Now run the model
model1_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train_q3
)
summary(model1_q3)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation, data = train_q3)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7222 -0.7872 -0.1954  0.1871 23.4121 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.3940977  0.0335506  11.746 < 0.0000000000000002 ***
as.factor(hour)1  -0.0977139  0.0144193  -6.777      0.0000000000123 ***
as.factor(hour)2  -0.1340615  0.0144109  -9.303 < 0.0000000000000002 ***
as.factor(hour)3  -0.1867155  0.0144452 -12.926 < 0.0000000000000002 ***
as.factor(hour)4  -0.1748870  0.0144752 -12.082 < 0.0000000000000002 ***
as.factor(hour)5  -0.0481245  0.0145117  -3.316             0.000912 ***
as.factor(hour)6   0.2138134  0.0145495  14.696 < 0.0000000000000002 ***
as.factor(hour)7   0.4927407  0.0145725  33.813 < 0.0000000000000002 ***
as.factor(hour)8   0.8922681  0.0146149  61.052 < 0.0000000000000002 ***
as.factor(hour)9   0.6469687  0.0146375  44.199 < 0.0000000000000002 ***
as.factor(hour)10  0.5327485  0.0145974  36.496 < 0.0000000000000002 ***
as.factor(hour)11  0.5624303  0.0145101  38.761 < 0.0000000000000002 ***
as.factor(hour)12  0.6732015  0.0144185  46.690 < 0.0000000000000002 ***
as.factor(hour)13  0.6935212  0.0143758  48.242 < 0.0000000000000002 ***
as.factor(hour)14  0.7445912  0.0143807  51.777 < 0.0000000000000002 ***
as.factor(hour)15  0.8279830  0.0144296  57.381 < 0.0000000000000002 ***
as.factor(hour)16  1.0219626  0.0144734  70.610 < 0.0000000000000002 ***
as.factor(hour)17  1.3858030  0.0145355  95.339 < 0.0000000000000002 ***
as.factor(hour)18  1.1083174  0.0145778  76.028 < 0.0000000000000002 ***
as.factor(hour)19  0.8919309  0.0145886  61.139 < 0.0000000000000002 ***
as.factor(hour)20  0.6268454  0.0145857  42.977 < 0.0000000000000002 ***
as.factor(hour)21  0.4128159  0.0145127  28.445 < 0.0000000000000002 ***
as.factor(hour)22  0.3157927  0.0144556  21.846 < 0.0000000000000002 ***
as.factor(hour)23  0.1548512  0.0143958  10.757 < 0.0000000000000002 ***
dotw_simple2       0.0735856  0.0079434   9.264 < 0.0000000000000002 ***
dotw_simple3       0.0751295  0.0079782   9.417 < 0.0000000000000002 ***
dotw_simple4       0.0468697  0.0079654   5.884      0.0000000040045 ***
dotw_simple5      -0.0473784  0.0079563  -5.955      0.0000000026056 ***
dotw_simple6      -0.0653380  0.0079484  -8.220 < 0.0000000000000002 ***
dotw_simple7      -0.1227242  0.0079501 -15.437 < 0.0000000000000002 ***
Temperature       -0.0018760  0.0003914  -4.793      0.0000016451977 ***
Precipitation     -0.0535064  0.0174455  -3.067             0.002162 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.227 on 349648 degrees of freedom
Multiple R-squared:  0.1116,    Adjusted R-squared:  0.1115 
F-statistic:  1417 on 31 and 349648 DF,  p-value: < 0.00000000000000022

Model 2: Add Temporal Lags

Code
model2_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train_q3
)

summary(model2_q3)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day, data = train_q3)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2148 -0.5344 -0.1526  0.2223 21.4940 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        0.1160760  0.0296289   3.918  0.00008943093627569 ***
as.factor(hour)1  -0.0274986  0.0127286  -2.160             0.030744 *  
as.factor(hour)2  -0.0081186  0.0127275  -0.638             0.523554    
as.factor(hour)3  -0.0206389  0.0127662  -1.617             0.105947    
as.factor(hour)4   0.0099397  0.0128002   0.777             0.437438    
as.factor(hour)5   0.1009008  0.0128356   7.861  0.00000000000000382 ***
as.factor(hour)6   0.2677691  0.0128830  20.785 < 0.0000000000000002 ***
as.factor(hour)7   0.4049385  0.0129270  31.325 < 0.0000000000000002 ***
as.factor(hour)8   0.6122588  0.0130130  47.050 < 0.0000000000000002 ***
as.factor(hour)9   0.3103126  0.0130032  23.864 < 0.0000000000000002 ***
as.factor(hour)10  0.2548051  0.0129193  19.723 < 0.0000000000000002 ***
as.factor(hour)11  0.2592292  0.0128474  20.178 < 0.0000000000000002 ***
as.factor(hour)12  0.3607666  0.0127686  28.254 < 0.0000000000000002 ***
as.factor(hour)13  0.3604016  0.0127400  28.289 < 0.0000000000000002 ***
as.factor(hour)14  0.3901808  0.0127504  30.601 < 0.0000000000000002 ***
as.factor(hour)15  0.4246068  0.0128060  33.157 < 0.0000000000000002 ***
as.factor(hour)16  0.5440485  0.0128779  42.247 < 0.0000000000000002 ***
as.factor(hour)17  0.7549667  0.0130153  58.006 < 0.0000000000000002 ***
as.factor(hour)18  0.4587761  0.0130465  35.165 < 0.0000000000000002 ***
as.factor(hour)19  0.3460763  0.0129926  26.636 < 0.0000000000000002 ***
as.factor(hour)20  0.1607276  0.0129834  12.379 < 0.0000000000000002 ***
as.factor(hour)21  0.1050185  0.0128607   8.166  0.00000000000000032 ***
as.factor(hour)22  0.1110356  0.0127829   8.686 < 0.0000000000000002 ***
as.factor(hour)23  0.0497331  0.0127111   3.913  0.00009134129697993 ***
dotw_simple2       0.0119189  0.0070139   1.699             0.089260 .  
dotw_simple3      -0.0048842  0.0070478  -0.693             0.488309    
dotw_simple4      -0.0256755  0.0070366  -3.649             0.000263 ***
dotw_simple5      -0.0795199  0.0070293 -11.313 < 0.0000000000000002 ***
dotw_simple6      -0.0651468  0.0070181  -9.283 < 0.0000000000000002 ***
dotw_simple7      -0.1012308  0.0070217 -14.417 < 0.0000000000000002 ***
Temperature       -0.0008822  0.0003455  -2.553             0.010674 *  
Precipitation      0.0419172  0.0154082   2.720             0.006520 ** 
lag1Hour           0.2395396  0.0016301 146.948 < 0.0000000000000002 ***
lag3Hours          0.1140298  0.0015873  71.839 < 0.0000000000000002 ***
lag1day            0.2736878  0.0016027 170.763 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.083 on 349645 degrees of freedom
Multiple R-squared:  0.308, Adjusted R-squared:  0.3079 
F-statistic:  4576 on 34 and 349645 DF,  p-value: < 0.00000000000000022

Question: Did adding lags improve R²? Why or why not?

Model 3: Add Demographics

Code
model3_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White,
  data = train_q3
)

# Summary too long with all station dummies, just show key metrics
cat("Model 3 R-squared:", summary(model3_q3)$r.squared, "\n")
Model 3 R-squared: 0.3142151 
Code
cat("Model 3 Adj R-squared:", summary(model3_q3)$adj.r.squared, "\n")
Model 3 Adj R-squared: 0.3141425 

Model 4: Add Station Fixed Effects

Code
model4_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station),
  data = train_q3
)

# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4_q3)$r.squared, "\n")
Model 4 R-squared: 0.3434496 
Code
cat("Model 4 Adj R-squared:", summary(model4_q3)$adj.r.squared, "\n")
Model 4 Adj R-squared: 0.342946 

Model 5: Add Rush Hour Interaction

Code
model5_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train_q3
)

cat("Model 5 R-squared:", summary(model5_q3)$r.squared, "\n")
Model 5 R-squared: 0.3479645 
Code
cat("Model 5 Adj R-squared:", summary(model5_q3)$adj.r.squared, "\n")
Model 5 Adj R-squared: 0.3474588 

Model performance summary (training set)

Code
model_rsq_q3 <- data.frame(
  Model = paste0("Model ", 1:5),
  Description = c("Time + Weather", "+ Temporal Lags", "+ Demographics", 
                  "+ Station Fixed Effects", "+ Rush Hour Interaction"),
  R_squared = c(
    summary(model1_q3)$r.squared,
    summary(model2_q3)$r.squared,
    summary(model3_q3)$r.squared,
    summary(model4_q3)$r.squared,
    summary(model5_q3)$r.squared
  )
)

kable(model_rsq_q3,
      caption = "2024 Q3 Model Performance",
      col.names = c("Model", "Description", "R²"),
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
2024 Q3 Model Performance
Model Description
Model 1 Time + Weather 0.112
Model 2 + Temporal Lags 0.308
Model 3 + Demographics 0.314
Model 4 + Station Fixed Effects 0.343
Model 5 + Rush Hour Interaction 0.348

Model Evaluation Summary (test set)

Code
# Create day of week factor with treatment (dummy) coding
test_q3 <- test_q3 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test_q3$dotw_simple) <- contr.treatment(7)

test_q3 <- test_q3 %>%
  mutate(
    pred1 = predict(model1_q3, newdata = test_q3),
    pred2 = predict(model2_q3, newdata = test_q3),
    pred3 = predict(model3_q3, newdata = test_q3),
    pred4 = predict(model4_q3, newdata = test_q3),
    pred5 = predict(model5_q3, newdata = test_q3)
  )

# Calculate MAE for each model
mae_results_q3 <- data.frame(
  Model = paste0("Model ", 1:5),
  Description = c("Time + Weather", "+ Temporal Lags", "+ Demographics", 
                  "+ Station Fixed Effects", "+ Rush Hour Interaction"),
  MAE_q3 = c(
    mean(abs(test_q3$Trip_Count - test_q3$pred1), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred2), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred3), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred4), na.rm = TRUE),
    mean(abs(test_q3$Trip_Count - test_q3$pred5), na.rm = TRUE)
  )
)

kable(mae_results_q3, 
      digits = 2,
      caption = "2024 Q3 Model Evaluation",
      col.names = c("Model","Description", "MAE")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
2024 Q3 Model Evaluation
Model Description MAE
Model 1 Time + Weather 0.83
Model 2 + Temporal Lags 0.71
Model 3 + Demographics 0.72
Model 4 + Station Fixed Effects 0.71
Model 5 + Rush Hour Interaction 0.71

9.2 2025 Q1 Models

Model 1: Baseline (Time + Weather)

Code
# Create day of week factor with treatment (dummy) coding
train_q1 <- train_q1 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(train_q1$dotw_simple) <- contr.treatment(7)

# Now run the model
model1_q1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
  data = train_q1
)
summary(model1_q1)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation, data = train_q1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9571 -0.3981 -0.1761  0.0246 18.5345 

Coefficients:
                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.186083   0.008656 -21.497 < 0.0000000000000002 ***
as.factor(hour)1  -0.022170   0.008662  -2.560              0.01048 *  
as.factor(hour)2  -0.038425   0.008665  -4.435 0.000009223337239405 ***
as.factor(hour)3  -0.051750   0.008669  -5.969 0.000000002382143657 ***
as.factor(hour)4  -0.040250   0.008675  -4.640 0.000003491064081685 ***
as.factor(hour)5   0.022443   0.008680   2.586              0.00972 ** 
as.factor(hour)6   0.173092   0.008688  19.924 < 0.0000000000000002 ***
as.factor(hour)7   0.323218   0.008691  37.190 < 0.0000000000000002 ***
as.factor(hour)8   0.549242   0.008694  63.175 < 0.0000000000000002 ***
as.factor(hour)9   0.388029   0.008696  44.622 < 0.0000000000000002 ***
as.factor(hour)10  0.279833   0.008700  32.164 < 0.0000000000000002 ***
as.factor(hour)11  0.297059   0.008703  34.134 < 0.0000000000000002 ***
as.factor(hour)12  0.367753   0.008691  42.314 < 0.0000000000000002 ***
as.factor(hour)13  0.347729   0.008672  40.100 < 0.0000000000000002 ***
as.factor(hour)14  0.351219   0.008662  40.548 < 0.0000000000000002 ***
as.factor(hour)15  0.406504   0.008664  46.917 < 0.0000000000000002 ***
as.factor(hour)16  0.488703   0.008668  56.379 < 0.0000000000000002 ***
as.factor(hour)17  0.620476   0.008676  71.520 < 0.0000000000000002 ***
as.factor(hour)18  0.423784   0.008687  48.786 < 0.0000000000000002 ***
as.factor(hour)19  0.270369   0.008693  31.104 < 0.0000000000000002 ***
as.factor(hour)20  0.153548   0.008685  17.679 < 0.0000000000000002 ***
as.factor(hour)21  0.090346   0.008678  10.411 < 0.0000000000000002 ***
as.factor(hour)22  0.069686   0.008668   8.039 0.000000000000000905 ***
as.factor(hour)23  0.045265   0.008661   5.227 0.000000172769575175 ***
dotw_simple2       0.053398   0.004640  11.509 < 0.0000000000000002 ***
dotw_simple3       0.059407   0.004803  12.370 < 0.0000000000000002 ***
dotw_simple4       0.033142   0.004647   7.131 0.000000000000995308 ***
dotw_simple5       0.044776   0.004642   9.646 < 0.0000000000000002 ***
dotw_simple6      -0.063711   0.004641 -13.728 < 0.0000000000000002 ***
dotw_simple7      -0.059444   0.004669 -12.731 < 0.0000000000000002 ***
Temperature        0.007989   0.000153  52.200 < 0.0000000000000002 ***
Precipitation     -1.125949   0.058754 -19.164 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7438 on 354112 degrees of freedom
Multiple R-squared:  0.0801,    Adjusted R-squared:  0.08002 
F-statistic: 994.6 on 31 and 354112 DF,  p-value: < 0.00000000000000022

Model 2: Add Temporal Lags

Code
model2_q1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day,
  data = train_q1
)

summary(model2_q1)

Call:
lm(formula = Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + 
    Precipitation + lag1Hour + lag3Hours + lag1day, data = train_q1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9320 -0.2812 -0.0997  0.0182 17.8035 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -0.0962851  0.0079137 -12.167 < 0.0000000000000002 ***
as.factor(hour)1  -0.0062075  0.0079108  -0.785              0.43264    
as.factor(hour)2  -0.0076621  0.0079144  -0.968              0.33299    
as.factor(hour)3  -0.0102415  0.0079194  -1.293              0.19594    
as.factor(hour)4   0.0051558  0.0079260   0.650              0.51538    
as.factor(hour)5   0.0524781  0.0079312   6.617      0.0000000000368 ***
as.factor(hour)6   0.1554233  0.0079441  19.565 < 0.0000000000000002 ***
as.factor(hour)7   0.2350093  0.0079592  29.527 < 0.0000000000000002 ***
as.factor(hour)8   0.3679467  0.0079919  46.040 < 0.0000000000000002 ***
as.factor(hour)9   0.1779574  0.0079899  22.273 < 0.0000000000000002 ***
as.factor(hour)10  0.1162892  0.0079704  14.590 < 0.0000000000000002 ***
as.factor(hour)11  0.1318394  0.0079848  16.511 < 0.0000000000000002 ***
as.factor(hour)12  0.2001697  0.0079651  25.131 < 0.0000000000000002 ***
as.factor(hour)13  0.1834693  0.0079446  23.094 < 0.0000000000000002 ***
as.factor(hour)14  0.1894740  0.0079350  23.878 < 0.0000000000000002 ***
as.factor(hour)15  0.2256510  0.0079437  28.406 < 0.0000000000000002 ***
as.factor(hour)16  0.2748485  0.0079597  34.530 < 0.0000000000000002 ***
as.factor(hour)17  0.3568287  0.0079903  44.658 < 0.0000000000000002 ***
as.factor(hour)18  0.1712821  0.0079927  21.430 < 0.0000000000000002 ***
as.factor(hour)19  0.0884182  0.0079724  11.091 < 0.0000000000000002 ***
as.factor(hour)20  0.0183403  0.0079699   2.301              0.02138 *  
as.factor(hour)21  0.0144891  0.0079401   1.825              0.06803 .  
as.factor(hour)22  0.0252975  0.0079210   3.194              0.00140 ** 
as.factor(hour)23  0.0216694  0.0079105   2.739              0.00616 ** 
dotw_simple2       0.0204023  0.0042393   4.813      0.0000014898921 ***
dotw_simple3       0.0109642  0.0043909   2.497              0.01252 *  
dotw_simple4      -0.0017846  0.0042478  -0.420              0.67440    
dotw_simple5       0.0119251  0.0042419   2.811              0.00494 ** 
dotw_simple6      -0.0718362  0.0042473 -16.914 < 0.0000000000000002 ***
dotw_simple7      -0.0412702  0.0042671  -9.672 < 0.0000000000000002 ***
Temperature        0.0036650  0.0001409  26.019 < 0.0000000000000002 ***
Precipitation     -0.6138873  0.0537180 -11.428 < 0.0000000000000002 ***
lag1Hour           0.2278355  0.0016252 140.191 < 0.0000000000000002 ***
lag3Hours          0.0968586  0.0015982  60.606 < 0.0000000000000002 ***
lag1day            0.2413146  0.0016078 150.093 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6793 on 354109 degrees of freedom
Multiple R-squared:  0.2327,    Adjusted R-squared:  0.2326 
F-statistic:  3159 on 34 and 354109 DF,  p-value: < 0.00000000000000022

Question: Did adding lags improve R²? Why or why not?

Model 3: Add Demographics

Code
model3_q1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White,
  data = train_q1
)

# Summary too long with all station dummies, just show key metrics
cat("Model 3 R-squared:", summary(model3_q1)$r.squared, "\n")
Model 3 R-squared: 0.2371779 
Code
cat("Model 3 Adj R-squared:", summary(model3_q1)$adj.r.squared, "\n")
Model 3 Adj R-squared: 0.2370982 

Model 4: Add Station Fixed Effects

Code
model4_q1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station),
  data = train_q1
)

# Summary too long with all station dummies, just show key metrics
cat("Model 4 R-squared:", summary(model4_q1)$r.squared, "\n")
Model 4 R-squared: 0.2647377 
Code
cat("Model 4 Adj R-squared:", summary(model4_q1)$adj.r.squared, "\n")
Model 4 Adj R-squared: 0.2641746 

What do station fixed effects capture? Baseline differences in demand across stations (some are just busier than others!).

Model 5: Add Rush Hour Interaction

Code
model5_q1 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
    lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) +
    rush_hour * weekend,  # Rush hour effects different on weekends
  data = train_q1
)

cat("Model 5 R-squared:", summary(model5_q1)$r.squared, "\n")
Model 5 R-squared: 0.2694846 
Code
cat("Model 5 Adj R-squared:", summary(model5_q1)$adj.r.squared, "\n")
Model 5 Adj R-squared: 0.268919 

Model performance summary (training set)

Code
model_rsq_q1 <- data.frame(
  Model = paste0("Model ", 1:5),
  Description = c("Time + Weather", "+ Temporal Lags", "+ Demographics", 
                  "+ Station Fixed Effects", "+ Rush Hour Interaction"),
  R_squared = c(
    summary(model1_q1)$r.squared,
    summary(model2_q1)$r.squared,
    summary(model3_q1)$r.squared,
    summary(model4_q1)$r.squared,
    summary(model5_q1)$r.squared
  )
)

kable(model_rsq_q1,
      caption = "2025 Q1 Model Performance",
      col.names = c("Model", "Description", "R²"),
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
2025 Q1 Model Performance
Model Description
Model 1 Time + Weather 0.080
Model 2 + Temporal Lags 0.233
Model 3 + Demographics 0.237
Model 4 + Station Fixed Effects 0.265
Model 5 + Rush Hour Interaction 0.269

Model Evaluation Summary (test set)

Code
# Create day of week factor with treatment (dummy) coding
test_q1 <- test_q1 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))

# Set contrasts to treatment coding (dummy variables)
contrasts(test_q1$dotw_simple) <- contr.treatment(7)

test_q1 <- test_q1 %>%
  mutate(
    pred1 = predict(model1_q1, newdata = test_q1),
    pred2 = predict(model2_q1, newdata = test_q1),
    pred3 = predict(model3_q1, newdata = test_q1),
    pred4 = predict(model4_q1, newdata = test_q1),
    pred5 = predict(model5_q1, newdata = test_q1)
  )

# Calculate MAE for each model
mae_results_q1 <- data.frame(
  Model = paste0("Model ", 1:5),
  Description = c("Time + Weather", "+ Temporal Lags", "+ Demographics", 
                  "+ Station Fixed Effects", "+ Rush Hour Interaction"),
  MAE_q1 = c(
    mean(abs(test_q1$Trip_Count - test_q1$pred1), na.rm = TRUE),
    mean(abs(test_q1$Trip_Count - test_q1$pred2), na.rm = TRUE),
    mean(abs(test_q1$Trip_Count - test_q1$pred3), na.rm = TRUE),
    mean(abs(test_q1$Trip_Count - test_q1$pred4), na.rm = TRUE),
    mean(abs(test_q1$Trip_Count - test_q1$pred5), na.rm = TRUE)
  )
)

kable(mae_results_q1, 
      digits = 2,
      caption = "2025 Q1 Model Evaluation",
      col.names = c("Model", "Description", "MAE")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
2025 Q1 Model Evaluation
Model Description MAE
Model 1 Time + Weather 0.62
Model 2 + Temporal Lags 0.54
Model 3 + Demographics 0.53
Model 4 + Station Fixed Effects 0.53
Model 5 + Rush Hour Interaction 0.53

9.3 Model Performance Comparison

Code
r2_q3_vec <- c(
  summary(model1_q3)$r.squared,
  summary(model2_q3)$r.squared,
  summary(model3_q3)$r.squared,
  summary(model4_q3)$r.squared,
  summary(model5_q3)$r.squared
)

mae_q3_vec <- c(
  mean(abs(test_q3$Trip_Count - test_q3$pred1), na.rm = TRUE),
  mean(abs(test_q3$Trip_Count - test_q3$pred2), na.rm = TRUE),
  mean(abs(test_q3$Trip_Count - test_q3$pred3), na.rm = TRUE),
  mean(abs(test_q3$Trip_Count - test_q3$pred4), na.rm = TRUE),
  mean(abs(test_q3$Trip_Count - test_q3$pred5), na.rm = TRUE)
)

r2_q1_vec <- c(
  summary(model1_q1)$r.squared,
  summary(model2_q1)$r.squared,
  summary(model3_q1)$r.squared,
  summary(model4_q1)$r.squared,
  summary(model5_q1)$r.squared
)

mae_q1_vec <- c(
  mean(abs(test_q1$Trip_Count - test_q1$pred1), na.rm = TRUE),
  mean(abs(test_q1$Trip_Count - test_q1$pred2), na.rm = TRUE),
  mean(abs(test_q1$Trip_Count - test_q1$pred3), na.rm = TRUE),
  mean(abs(test_q1$Trip_Count - test_q1$pred4), na.rm = TRUE),
  mean(abs(test_q1$Trip_Count - test_q1$pred5), na.rm = TRUE)
)

comparison_matrix <- data.frame(
  Model = c("1. Time + Weather", "2. + Temporal Lags", "3. + Demographics", "4. + Station FE", "5. + Rush Hour"),
  R2_Q3_2024 = r2_q3_vec,
  R2_Q1_2025 = r2_q1_vec,
  MAE_Q3_2024 = mae_q3_vec,
  MAE_Q1_2025 = mae_q1_vec
)

kable(comparison_matrix, 
      digits = 3, 
      caption = "Model Performance Comparison: Summer (Q3 2024) vs Winter (Q1 2025)",
      col.names = c("Model Description", "Q3 (Summer)", "Q1 (Winter)", "Q3 (Summer)", "Q1 (Winter)")) %>%
  
  add_header_above(c(" " = 1, "R-Squared (Training Fit)" = 2, "MAE (Testing Error)" = 2)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  
  row_spec(4, bold = T, color = "black", background = "#e6f3ff")
Model Performance Comparison: Summer (Q3 2024) vs Winter (Q1 2025)
R-Squared (Training Fit)
MAE (Testing Error)
Model Description Q3 (Summer) Q1 (Winter) Q3 (Summer) Q1 (Winter)
1. Time + Weather 0.112 0.080 0.830 0.625
2. + Temporal Lags 0.308 0.233 0.714 0.535
3. + Demographics 0.314 0.237 0.717 0.533
4. + Station FE 0.343 0.265 0.713 0.527
5. + Rush Hour 0.348 0.269 0.714 0.530
Code
ggplot(mae_results_q3, aes(x = reorder(Model, -MAE_q3), y = MAE_q3)) +
  geom_col(fill = "#3182bd", alpha = 0.8) +
  geom_text(aes(label = round(MAE_q3, 2)), vjust = -0.5) +
  labs(
    title = "Model Performance Comparison",
    subtitle = "Lower MAE = Better Predictions",
    x = "Model",
    y = "Mean Absolute Error (trips)"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Which features gave us the biggest improvement?

  • MAE Value Comparison

    The Winter model achieved a lower Mean Absolute Error compared to the Summer model, but this numerical superiority is largely an artifact of significantly lower ridership volumes rather than better predictive ability. Since winter data is heavily “zero-inflated”—meaning many station-hours have zero trips—the absolute errors are naturally smaller, whereas the high-volume, dynamic nature of summer ridership creates larger numerical fluctuations that result in higher absolute error values despite the summer model actually explaining more of the variance (R2 of 0.343 vs. 0.265).

  • Temporal Patterns Difference

    Temporal patterns differ markedly between the seasons, with Summer displaying a more robust and predictable structure driven by a consistent blend of commuting and leisure activity, which allowed the model to explain 34.3% of the variance. Conversely, Winter patterns are more stochastic and difficult to model (lower R2 of 26.5%), likely because cold-weather riding is strictly utilitarian and highly sensitive to immediate, random weather events rather than the reliable behavioral routines seen in warmer months.

  • The most important features

    In the Summer quarter, Temporal Lags were unequivocally the most important feature, as their addition caused the model’s R2 to surge from a baseline of 0.112 to 0.308, proving that recent demand history is the single strongest predictor of near-future activity. While demographic variables added negligible value, Station Fixed Effects served as the second most critical component, raising the R2 to 0.343 by successfully capturing the intrinsic, static popularity differences between high-traffic hubs and quiet residential stations. 


PART 2: Error Analysis

10. Spatial Error Patterns

Code
# Calculate station errors
station_errors <- test_q3 %>%
  filter(!is.na(pred4)) %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(
    MAE = mean(abs(Trip_Count - pred4), na.rm = TRUE),
    avg_demand = mean(Trip_Count, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(start_lat), !is.na(start_lon))

# Map 1: Prediction Errors
p1 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = MAE),
    size = 1.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "plasma",
    name = "MAE\n(trips)",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5),  # Fewer, cleaner breaks
    labels = c("0.5", "1.0", "1.5")
  ) +
  labs(title = "Prediction Errors",
       subtitle = "Higher in Center City") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))

# Map 2: Average Demand
p2 <- ggplot() +
  geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
  geom_point(
    data = station_errors,
    aes(x = start_lon, y = start_lat, color = avg_demand),
    size = 1.5,
    alpha = 0.7
  ) +
  scale_color_viridis(
    option = "viridis",
    name = "Avg\nDemand",
    direction = -1,
    breaks = c(0.5, 1.0, 1.5, 2.0, 2.5),  # Clear breaks
    labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
  ) +
  labs(title = "Average Demand",
       subtitle = "Trips per station-hour") +
  mapTheme +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 9),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10)
  ) +
  guides(color = guide_colorbar(
    barwidth = 1.5,
    barheight = 12,
    title.position = "top",
    title.hjust = 0.5
  ))

# Combine
grid.arrange(
  p1, p2,
  ncol = 2
  )

Are prediction errors clustered in certain parts of Philadelphia?

  • High Demand = High Error: The maps show a very strong link between demand and error.The areas with the most trips are almost the same areas with the biggest errors.

  • Where are the errors concentrated?

    • University City: This area is home to UPenn and Drexel. Student ridership changes quickly based on class schedules and social events.
    • Center City Area: This is the business and political district. It gets busy with conferences, office workers, and visitors.
  • Why are errors higher in these areas?

    • Diverse User Mix: In quiet neighborhoods, most riders are just commuters. In Center City, there is a mix of tourists, students, delivery workers, and commuters. They all behave differently.
    • Event Spikes: Concerts, sports games, and conventions happen here. These create sudden surges in ridership that our current model does not know about.
    • Station Switching: Stations here are dense. If a station is full, a rider might walk to the next block. This makes the first station look less popular and the second station look more popular than they really are.
    • Abundant Travel Alternatives: These central areas have subways, buses, and ride-shares (Uber/Lyft). If the weather is slightly bad, riders here can easily switch to other transportation. By contrast, in areas with fewer options, riders might stick to biking.

11. Temporal Error Patterns

Code
test_q3 <- test_q3 %>%
  mutate(
    error = Trip_Count - pred4,
    abs_error = abs(error),
    time_of_day = case_when(
      hour < 7 ~ "Overnight",
      hour >= 7 & hour < 10 ~ "AM Rush",
      hour >= 10 & hour < 15 ~ "Mid-Day",
      hour >= 15 & hour <= 18 ~ "PM Rush",
      hour > 18 ~ "Evening"
    )
  )

# Scatter plot by time and day type
ggplot(test_q3, aes(x = Trip_Count, y = pred4)) +
  geom_point(alpha = 0.2, color = "#3182bd") +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  facet_grid(weekend ~ time_of_day) +
  labs(
    title = "Observed vs. Predicted Bike Trips",
    subtitle = "Model 4 performance by time period",
    x = "Observed Trips",
    y = "Predicted Trips",
    caption = "Red line = perfect predictions; Green line = actual model fit"
  ) +
  plotTheme

Code
# MAE by time of day and day type
temporal_errors <- test_q3 %>%
  group_by(time_of_day, weekend) %>%
  summarize(
    MAE = mean(abs_error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))

ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
  labs(
    title = "Prediction Errors by Time Period",
    subtitle = "2024 Q3",
    x = "Time of Day",
    y = "Mean Absolute Error (trips)",
    fill = "Day Type"
  ) +
  plotTheme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When are we most wrong?

  • Highest Errors (When we struggle):
    • Weekday PM Rush: This has the absolute highest error. Commutes home are chaotic; people leave work at different times or switch to ride-shares (Uber/Lyft) if they are tired, making demand very hard to predict.
    • Weekday AM Rush: This has the second-highest error. The sheer volume of thousands of commuters trying to ride at the exact same time means even small percentage mistakes result in large errors in total trip counts.
  • Lowest Errors (When we are accurate):
    • Overnight: This is the most accurate period. Ridership is near zero.
    • Weekday Mid-Day: Errors are relatively low here. Unlike rush hour, lunch and errand trips are steady, and stations are rarely full, meaning “unmet demand” does not confuse the model.
  • How does this pattern affects model utility?
    • Hidden “Unmet Demand”: The high error during rush hour suggests that our model is failing to capture “lost sales”, which means people who wanted a bike but found an empty station.
    • Resource Focus: We should not waste time trying to improve “Overnight” predictions since they are already excellent. All future development efforts must focus on adding features (like real-time traffic or transit delays) to fix the “PM Rush” errors.

12. Demographics Error Patterns

Code
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
  left_join(
  stations_census_q3 %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
    by = "start_station"
  ) %>%
  filter(!is.na(Med_Inc))

# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = scales::dollar) +
  labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
  plotTheme

p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
  plotTheme

p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
  geom_point(alpha = 0.5, color = "#3182bd") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
  plotTheme

grid.arrange(p1, p2, p3, ncol = 3)

PART 3: Feature Engineering & model improvement

13.1 Temporal features: School calendar

13.1.1 Add School Feature

Code
study_panel_complete_q3 <- study_panel_complete_q3 %>%
  mutate(
    # Penn Session Logic
    is_penn_session = case_when(
      date >= "2024-07-01" & date <= "2024-07-03" ~ 1,
      date >= "2024-07-05" & date <= "2024-08-09" ~ 1,
      date >= "2024-08-21" & date <= "2024-09-03" ~ 1,
      date >= "2024-09-05" & date <= "2024-09-30" ~ 1,
      TRUE ~ 0 # All other dates (holidays/breaks) are 0
    ),
    
    # Drexel Session Logic
    is_drexel_session = case_when(
      date >= "2024-07-01" & date <= "2024-07-03" ~ 1,
      date >= "2024-07-06" & date <= "2024-07-27" ~ 1,
      date >= "2024-07-29" & date <= "2024-08-31" ~ 1,
      date >= "2024-09-03" & date <= "2024-09-20" ~ 1,
      TRUE ~ 0 # All other dates (holidays/breaks) are 0
    )
  )

13.1.2 Refresh the Split

Code
# Re-apply the split using the same week logic as Part 8.1
train_q3 <- study_panel_complete_q3 %>% filter(week < 36)
test_q3 <- study_panel_complete_q3 %>% filter(week >= 36)

13.1.3 Build Model 6 (+School features)

Code
train_q3 <- train_q3 %>% 
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) 
contrasts(train_q3$dotw_simple) <- contr.treatment(7)

model6_q3 <- lm( Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation + lag1Hour + lag3Hours + lag1day + Med_Inc + Percent_Taking_Transit + Percent_White + as.factor(start_station) + is_penn_session +  is_drexel_session,
                data = train_q3 )

cat("Model 6 R-squared:", summary(model6_q3)$r.squared, "\n")
Model 6 R-squared: 0.3434728 
Code
cat("Model 6 Adj R-squared:", summary(model6_q3)$adj.r.squared, "\n") 
Model 6 Adj R-squared: 0.3429655 

13.1.4 Evaluate Model 6 (+School features)

Code
# Prepare Test Set Factors
test_q3 <- test_q3 %>% 
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test_q3$dotw_simple) <- contr.treatment(7)

# Predict
test_q3$pred_school <- predict(model6_q3, newdata = test_q3)

# Calculate MAE
mae_school <- mean(abs(test_q3$Trip_Count - test_q3$pred_school), na.rm = TRUE)

cat("New Feature MAE:", round(mae_school, 3), "\n")
New Feature MAE: 0.711 
Code
cat("Did the school calendar improve the model?", ifelse(mae_school < 0.713, "YES", "NO"), "\n")
Did the school calendar improve the model? YES 

Why school feature?

  • Motivation: We observed that prediction errors were heavily concentrated in University City and Center City, suggesting that our model was missing key drivers of demand in these student-heavy areas.
  • Feature Engineering: To address this, we integrated the academic calendars of the University of Pennsylvania (UPenn) and Drexel University. We created new binary variables is_penn_session and is_drexel_session to indicate whether a given date was an active “school day” (semester in session) versus a break/holiday.
  • Result: Adding this feature slightly improved the model’s accuracy. Compared to Model 4 (our best-performing baseline), the Mean Absolute Error (MAE) decreased by 0.02, confirming that accounting for student population dynamics helps reduce errors in these high-demand zones.

13.2 Weather features: Weekend + nice weather interaction

13.2.1 Add Weather Interaction

Code
# 1. Define "Nice Weather"
# Criteria: Temperature between 60°F and 75°F, and NO rain (Precipitation == 0)
# We create a binary flag for this.
study_panel_complete_q3 <- study_panel_complete_q3 %>%
  mutate(
    nice_weather = ifelse(Temperature >= 60 & Temperature <= 75 & Precipitation == 0, 1, 0),
    
# 2. Create the Interaction Term
# This specifically targets "Nice Weekends" (Weekend == 1 AND Nice Weather == 1)
    weekend_nice = weekend * nice_weather
  )

13.2.2 Refresh the Split

Code
# Train on weeks < 36, Test on weeks >= 36
train_q3 <- study_panel_complete_q3 %>% filter(week < 36)
test_q3 <- study_panel_complete_q3 %>% filter(week >= 36)

13.2.3 Build Model 7(+ Weather Interaction)

Code
# Ensure factors are set up (standard prep)
train_q3 <- train_q3 %>% 
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(train_q3$dotw_simple) <- contr.treatment(7)

# Model 7: Model 4 + Schools + Weather Interaction
model7_q3 <- lm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) + 
    is_penn_session +    # Keep School Feature
    is_drexel_session +  # Keep School Feature
    weekend_nice,        # <--- NEW INTERACTION FEATURE
  data = train_q3
)

# Output Model Summary
cat("Model 7 Adj R-squared:", summary(model7_q3)$r.squared, "\n")
Model 7 Adj R-squared: 0.3434627 
Code
cat("Model 7 Adj R-squared:", summary(model7_q3)$adj.r.squared, "\n")
Model 7 Adj R-squared: 0.3429554 

13.2.4 Evaluate Model 7

Code
# Prepare Test Set
test_q3 <- test_q3 %>% 
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test_q3$dotw_simple) <- contr.treatment(7)

# Predict
test_q3$pred_weather <- predict(model7_q3, newdata = test_q3)

# Calculate MAE
mae_weather <- mean(abs(test_q3$Trip_Count - test_q3$pred_weather), na.rm = TRUE)

cat("MAE (Schools + Weather):", round(mae_weather, 3), "\n")
MAE (Schools + Weather): 0.71 
Code
cat("Did the Weekend + nice weather interaction improve the model?", ifelse(mae_weather < 0.711, "YES", "NO"), "\n")
Did the Weekend + nice weather interaction improve the model? YES 

Why weather and weekend interaction?

  • Motivation: We observed that prediction errors varied between weekdays and weekends. Furthermore, high-error zones like Center City have highly developed transit networks with many alternatives (subways, buses, ride-shares). In these areas, rider choice is highly elastic; people will easily switch to other modes if conditions aren’t perfect, making bike demand very sensitive to external factors like weather.
  • Feature Engineering: To capture this sensitivity, we defined a nice_weather variable representing ideal riding conditions: temperatures between 60°F and 75°F with zero precipitation. We then created an interaction term (weekend * nice_weather) to specifically target discretionary leisure trips that occur on sunny weekends.
    • Since the nice_weather variable is directly derived from temperature data, we removed the original Temperature variable from the final model. This was done to avoid multicollinearity, ensuring the model focuses specifically on the impact of “ideal” conditions rather than raw temperature fluctuations.
  • Result: This adjustment led to a measurable improvement in accuracy. Compared to Model 4 (our best-performing baseline), the Mean Absolute Error (MAE) decreased by 0.03, demonstrating that explicitly modeling “ideal leisure conditions” captures demand spikes better than raw temperature data alone.

13.3 Attempt for Poisson Model

13.3.1 Build Model 8 (Poisson)

Code
# Fit Poisson Model using the best features from Model 7
# Note: family = "poisson" is the key change here
model8_q3 <- glm(
  Trip_Count ~ as.factor(hour) + dotw_simple + Precipitation +
    lag1Hour + lag3Hours + lag1day +
    Med_Inc + Percent_Taking_Transit + Percent_White +
    as.factor(start_station) + 
    is_penn_session +
    is_drexel_session +  
    weekend_nice,        
  family = "poisson",
  data = train_q3
)

# Output Model Summary
cat("Model 8 Adj R-squared:", summary(model8_q3)$r.squared, "\n")
Model 8 Adj R-squared: 
Code
cat("Model 8 Adj R-squared:", summary(model8_q3)$adj.r.squared, "\n")
Model 8 Adj R-squared: 

13.3.2 Evaluate Model 8

Code
# 1. Predict on Test Data
# type = "response" is crucial for GLM to get the actual count (not log-count)
test_q3$pred_poisson <- predict(model8_q3, newdata = test_q3, type = "response")

# 2. Calculate MAE for Poisson
mae_m8 <- mean(abs(test_q3$Trip_Count - test_q3$pred_poisson), na.rm = TRUE)

# 3. Compare with Best OLS Model (Model 7)
cat("Model 7 (OLS) MAE:", round(mae_weather, 3), "\n")
Model 7 (OLS) MAE: 0.71 
Code
cat("Poisson Model MAE:", round(mae_m8, 3), "\n")
Poisson Model MAE: 0.666 
Code
cat("Did Poisson improve performance?", ifelse(mae_m8 < mae_weather, "YES", "NO"), "\n")
Did Poisson improve performance? YES 
Code
test_q3 <- study_panel_complete_q3 %>% filter(week >= 36)

test_q3 <- test_q3 %>%
  mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
contrasts(test_q3$dotw_simple) <- contr.treatment(7)

# Recalculate the predicted values of all models
preds <- list(
m1 = predict(model1_q3, newdata = test_q3),
  m2 = predict(model2_q3, newdata = test_q3),
  m3 = predict(model3_q3, newdata = test_q3),
  m4 = predict(model4_q3, newdata = test_q3),
  m5 = predict(model5_q3, newdata = test_q3),
  m6 = predict(model6_q3, newdata = test_q3),
  m7 = predict(model7_q3, newdata = test_q3),
  pois = predict(model8_q3, newdata = test_q3, type = "response") 
)

mae_values <- sapply(preds, function(p) mean(abs(test_q3$Trip_Count - p), na.rm = TRUE))

# create a table
all_models_df <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6", "Model 7", "Model 8"),
  Description = c(
    "Time + Weather (Baseline)",
    "+ Temporal Lags",
    "+ Demographics",
    "+ Station Fixed Effects",
    "+ Rush Hour Interaction",
    "+ School Calendars",
    "+ Weather Interaction",
    "Poisson Regression (Full)"
  ),
  MAE = mae_values
)

# Setup baseline(Model 1)
baseline_mae <- all_models_df$MAE[1]
all_models_df$Improvement_vs_M1 <- (baseline_mae - all_models_df$MAE) / baseline_mae

# Setup baseline2(Model2)
lag_mae <- all_models_df$MAE[2]
all_models_df$Improvement_vs_M2 <- (lag_mae - all_models_df$MAE) / lag_mae

all_models_df %>%
  mutate(
    MAE = round(MAE, 3),
    `% Better than M1` = scales::percent(Improvement_vs_M1, accuracy = 0.1),
    `% Better than M2` = scales::percent(Improvement_vs_M2, accuracy = 0.1)
  ) %>%
  select(Model, Description, MAE, `% Better than M1`, `% Better than M2`) %>%
  kable(
    caption = "Part 3: All Models Performance Summary - Q3 2024",
    align = c("l", "l", "c", "c", "c")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  row_spec(which.min(all_models_df$MAE), bold = T, color = "white", background = "#3182bd")
Part 3: All Models Performance Summary - Q3 2024
Model Description MAE % Better than M1 % Better than M2
m1 Model 1 Time + Weather (Baseline) 0.830 0.0% -16.3%
m2 Model 2 + Temporal Lags 0.714 14.0% 0.0%
m3 Model 3 + Demographics 0.717 13.7% -0.3%
m4 Model 4 + Station Fixed Effects 0.713 14.1% 0.1%
m5 Model 5 + Rush Hour Interaction 0.714 14.0% 0.0%
m6 Model 6 + School Calendars 0.711 14.3% 0.4%
m7 Model 7 + Weather Interaction 0.710 14.5% 0.6%
pois Model 8 Poisson Regression (Full) 0.666 19.9% 6.8%

Feature Engineering Impact

  • Marginal Improvements: Adding specific domain features yielded small but consistent improvements. Introducing School Calendars (Model 6) reduced the MAE from 0.713 to 0.711, and adding the Weather Interaction (Model 7) further reduced it to 0.710.
  • Cumulative Effect: While individual features like “UPenn Session” or “Nice Weekends” did not drastically shift the global error, they pushed the total improvement over the baseline (Model 1) from 14.1% to 14.5%. This suggests that while these variables capture real behavioral drivers, their signal is relatively subtle compared to the dominant trends.

Why the impact is slight?

  • The “Ceiling Effect” of Lag Variables
    • Lag Redundancy: The reason Models 6 and 7 improved performance only minimally is likely due to the strength of the Temporal Lags (Model 2). The number of trips taken “1 hour ago” or “24 hours ago” already implicitly captures the effects of weather and school schedules. (e.g., If it is raining, the lag1Hour variable is already low, so the model already knows demand is suppressed).
    • Localized vs. Global: Features like School Calendars primarily affect specific zones (University City). When averaging the error across all stations in Philadelphia, the improvement in those specific neighborhoods gets diluted, resulting in a small change in the system-wide MAE.

The Advantage of Poisson Regression

  • Handling Count Data: The most significant leap in performance came not from adding more variables, but from changing the mathematical engine. Model 8 (Poisson) achieved an MAE of 0.666, a massive drop compared to the linear models (0.710).
  • Solving the “Zero” Problem: Linear regression (Models 1-7) can predict negative numbers or decimals (e.g., ” -0.5 trips”), which is physically impossible. The Poisson model is designed specifically for non-negative integers. It excels at predicting the many hours where station demand is exactly zero, which constitutes a large portion of the dataset (overnight and low-demand stations).

Pratical Recommendation

  • Select Model 8: We strongly recommend deploying the Poisson Regression (Model 8). It offers a 19.8% improvement over the baseline and is 6.8% better than the best linear model.
  • Reliability: Beyond just having the lowest error, the Poisson model is operationally safer because it will never predict negative demand. This ensures that rebalancing algorithms receive valid, usable inputs for every station, every hour.

PART 4: Critical Reflection

14.1 Operational implications

14.1.1 Is your final MAE “good enough” for Indego to use?

  • In short, it depends.
  • Yes, for general planning: The final Poisson model achieved an MAE of 0.67 trips per hour. This means, on average, the model is off by less than one bike per hour. This is highly accurate for the majority of the network.
  • No, for peak capacity management: However, an “average” metric hides specific failures. In high-demand zones like Center City and University City, errors can exceed 1.5 trips per hour. During a 3-hour rush window, this creates a cumulative error of ~5 bikes, which is enough to completely empty a small docking station.

14.1.2 When do prediction errors cause problems for rebalancing?

  • The “PM Rush” Danger Zone: The model is least accurate during the Weekday PM Rush, where errors are highest. This is the most dangerous time for errors because:
    • Traffic Constraints: Rebalancing trucks are stuck in the same rush hour traffic as commuters. If the model makes a mistake, the operations team cannot move vehicles fast enough to fix it.
    • Cascading Failures: If the model fails to predict a surge at 5:00 PM, a station goes empty. Riders then walk to the next station, overwhelming that one too, creating a domino effect the model cannot see.

14.1.3 Would you recommend deploying this system? Under what conditions?

  • Recommendation: Yes, we recommend deploying the Poisson Regression (Model 8), but with specific operational rules.
    • Condition 1 - Automated vs. Manual Zones:
      • Automate Equity Zones: The model is very accurate in lower-income and transit-dependent neighborhoods. Indego can automate rebalancing schedules for these areas to ensure reliable access.
      • Monitor High-Volume Zones: For Center City and University City, use the model as a “guide,” but keep human dispatchers monitoring real-time levels during PM Rush.
    • Condition 2: Use Poisson Only: Do not use the linear models. The Poisson model ensures no negative predictions, which prevents the rebalancing software from crashing or suggesting impossible moves (e.g., “remove -2 bikes”).

14.2 Equity considerations

14.2.1 Do prediction errors disproportionately affect certain neighborhoods?

  • Yes, but in an unexpected way: Contrary to many AI systems that fail on marginalized groups, this model is actually least accurate in wealthy, white neighborhoods.
  • The Evidence: There is a clear positive correlation between Median Income and Prediction Error. This means the model makes its biggest mistakes in the richest parts of the city (like Rittenhouse and Center City).
  • The “Equity Zone” Advantage: Conversely, the model performs significantly better (lower error) in lower-income neighborhoods and areas with higher minority populations. This is likely because ridership in these areas is more consistent and less prone to the chaotic spikes seen in tourist/business districts.

14.2.2 Could this system worsen existing disparities in bike access?

  • Even though the model works well for underserved neighborhoods, it creates a dangerous operational incentive. Operations managers might see “Red Alert” (high errors/shortages) constantly flashing in Center City and rush all the effort there to fix it.
  • Resource Drain: If rebalancing teams are too focused on fixing the “unpredictable” problems in wealthy areas, they might pull resources away from lower-income neighborhoods.
  • The Result: Equity zones might suffer from gradual neglect, and it is not because the model is wrong, but because the model’s “alarms” are louder elsewhere.

14.2.3 What safeguards would you recommend?

  • Protected Resource Allocation: Establish a “minimum fleet guarantee” for equity neighborhoods. Trucks assigned to these zones cannot be diverted to Center City, no matter how high the demand spikes there.
  • Different Metrics for Different Zones:
    • In Center City: Optimize for Turnover (moving bikes fast).
    • In Equity Zones: Optimize for “Reliability” (ensuring a bike is always there).
  • Human Override: Do not allow the algorithm to fully automate dispatching. A human manager must verify that “chasing demand” in high-income areas is not leaving transit-dependent neighborhoods empty.

14.3 Model limitations

14.3.1 What patterns is your model missing?

  • “Censored” Demand (The biggest missing piece): Our model assumes that Trip Count equals User Demand. This is false. If a station is empty, the trip count is 0, even if 10 people wanted a bike. The model interprets this as “no demand,” failing to capture true interest.
  • Event-Driven Spikes: The model knows where the hotspots are (University City, Center City), but it doesn’t know when special events happen. It misses the massive surges caused by Phillies and Eagles games, concerts, or large conferences that cause the unpredictable errors we saw in the spatial maps.
  • Network Effects: The model treats each station as a lonely island. It doesn’t understand that if Station A is full, riders will overflow to Station B nearby.

14.3.2 What assumptions might not hold in real deployment?

  • The Infinite Capacity Assumption: The model predicts demand as if there are always bikes available. In the real world, physical capacity is limited. A prediction of “10 trips” is useless if the station only holds 5 bikes.
  • The Perfect Weather Forecast Assumption:
    • The Assumption: Our model was trained using historical weather records (e.g., we know for a fact it rained on July 10th). It assumes that the weather data input is 100% accurate.
    • The Reality: In real-time deployment, we have to use weather forecasts (predictions), not historical records. Forecasts are often wrong. If the forecast says “Sunny” but it actually rains, our model will confidently predict high demand, leading to a massive operational error (sending bikes to stations where no one is riding).
  • The Bike Type Indifference Assumption (E-bike Preference):
    • The Assumption: Our model predicts total Trip Count, treating all bikes as equal. It assumes a rider is just as happy to take a standard bike as an electric bike.
    • The Reality: In 2024/2025, riders heavily prefer electric bikes (E-bikes). If a station has 5 bikes but they are all heavy, non-electric classic bikes, users might skip that station and look for an E-bike elsewhere. Our model ignores the composition of the fleet and will over-predict demand for stations that only have classic bikes left.
  • The Perfect Fleet Health Assumption:
    • The Assumption: The model assumes that if the system says a station is “Active,” it is fully functional.
      -The Reality: In reality, vandalism, dead batteries, or flat tires are common. A station might physically have 10 bikes, but if 8 are broken (unavailable for unlock), the actual capacity is only 2. The model will predict high demand based on history, but actual ridership will be near zero because the bikes are unrideable. This “maintenance gap” is completely invisible to our current model.

14.3.3 How would you improve this with more time/data?

  • Add Real-Time Dock Status Data: We would incorporate data on how many minutes per hour a station was empty. This would allow us to estimate “true demand” (e.g., if a station was empty for 30 minutes, we should double the observed trip count to guess the real demand).
  • Scrape Event Calendars: We would build a scraper for local event websites (eg.Ticketmaster) to create a “Special Event” flag. This would significantly reduce the high errors seen in Center City.
  • Spatial Lag Features: Instead of just looking at one station’s history (lag1Hour), we would add a feature for the history of neighboring stations. This would help the model understand that high demand next door usually means high demand here soon.